Lecture 3.4 - Hypothesis Test Wisdom
Hypothesis Testing Wisdom
Setting Expectations
Today we are working with a dataset on wages in the U.S. collected by the US Department of Labor: us.dol.wages.csv
We will first assume that the dataset approximately represents all workers in the U.S. – statistics of the mean/proportions can serve as stand-ins for the true population parameters.
- Subset your data and take a sample of 100 from the South region only using the
slice_sample()
command (documented here):
south_sample <- us.dol.wages %>%
filter(south=="yes") %>%
slice_sample(n=100, replace=TRUE)
- Write down your expectations about the following variables and whether and how they might be different than the population. Use Google if needed.
ed - education (years)
wage - salary per week
bluecol - whether the worker works in a blue collar job
union - whether the person is in a union
You can find the proportions/means of these variables with the table()
and summary()
commands.
Confidence interval, \(p\) values, and \(\alpha\) values
First, consider the issue of an \(\alpha\) value.
What, in your opinion, is a reasonable choice for the \(\alpha\) value? (i.e. how much proof would you need to be convinced there was a ‘real’ difference between the sample and the population)? Write down some reasons for your choice.
Now, generate alternative and null hypotheses, fully specifying the \(\alpha\) level and the tailed-ness of the test for the following variables. The null hypothesis should be the overal population proportions/means.
ed
wage
bluecol
union
Make a note of why you chose a one or two tailed hypothesis test.
Next, make a small table with both the \(p\) value and the confidence interval for each of these variables. Check the conditions for hypothesis testing.
Interpret your p-values and confidence intervals with respect to your \(\alpha\) level. In the end, which of the variables do you believe is statistically significantly different from the population, based on your sample?
Add two additional columns to your table with the ‘true’ value of the variables for just the
south
region and the ‘true’ value of your variables for the overall dataset. Were your conclusions correct or not?
Errors & Effect Size
For the same variables as those listed above, put yourself in the shoes of a policymaker that is considering additional government programs to help people in the south
region if it can be shown that they are different on some important variables.
Add another column to your table. Assess whether you think a Type I or Type II error would be more serious for each of the variables. Provide a justification for why.
Add another column to your table with the size of the difference. Assess whether the difference is substantively large or not. How do you know if that is a large difference? You may want to check Google, etc. to see what the normal range of variation for these variables are.
Overall, with your partner, write up a summary paragraph with the results of your findings and how we should interpret the results of your calculations and thoughts based on your 100-person sample of Southern workers.
Extra time
- Develop a regression model that predicts
wage
. Is the variablesouth
an important predictor in your model? Why or why not?